Extraction of Unmarked Quotations in Newspapers A Study Based on Direct Speech Extraction Systems
نویسندگان
چکیده
This paper presents work in progress to automatically extract quotation sentences from newspaper articles. The focus is the extraction and annotation of unmarked quotation sentences. A linguistic study shows that unmarked quotation sentences can be formalised into 16 patterns that can be used to develop an extraction grammar. The question of unmarked quotation boundaries identification is also raised as they are often ambiguous. An annotation scheme allowing to describe all the elements that can take place in a quotation sentence is defined. This paper presents the creation of two resources necessary to our system. A dictionary of verbs introducing quotations has been automatically built using a grammar of marked quotations sentences to identify the verbs able to introduce quotations. A grammar formalising the patterns of unmarked quotation sentences – using the tool Unitex, based on finite state machines – has been developed. A short experiment has been performed on two patterns and shows some promising results.
منابع مشابه
Automatically Detecting and Attributing Indirect Quotations
Direct quotations are used for opinion mining and information extraction as they have an easy to extract span and they can be attributed to a speaker with high accuracy. However, simply focusing on direct quotations ignores around half of all reported speech, which is in the form of indirect or mixed speech. This work presents the first large-scale experiments in indirect and mixed quotation ex...
متن کاملA Lexicon of French Quotation Verbs for Automatic Quotation Extraction
Quotation extraction is an important information extraction task, especially when dealing with news wires. Quotations can be found in various configurations. In this paper, we focus on direct quotations introduced by a parenthetical clause, headed by a “quotation verb”. Our study is based on a large French news wire corpus from the Agence France-Presse. We introduce and motivate an analysis at ...
متن کاملAudio quotation marks for natural language understanding
Detecting the presence of quotations in speech is a difficult task for automatic natural language understanding. This paper presents a study on the correlation between three prosodic features present in a voice command and the presence or absence of quotations. These features consist of intra-word pause durations, F0 reset and F0 continuity. A combination of lexical and prosodic extraction tool...
متن کاملA New Method for Improving Computational Cost of Open Information Extraction Systems Using Log-Linear Model
Information extraction (IE) is a process of automatically providing a structured representation from an unstructured or semi-structured text. It is a long-standing challenge in natural language processing (NLP) which has been intensified by the increased volume of information and heterogeneity, and non-structured form of it. One of the core information extraction tasks is relation extraction wh...
متن کاملA review on EEG based brain computer interface systems feature extraction methods
The brain – computer interface (BCI) provides a communicational channel between human and machine. Most of these systems are based on brain activities. Brain Computer-Interfacing is a methodology that provides a way for communication with the outside environment using the brain thoughts. The success of this methodology depends on the selection of methods to process the brain signals in each pha...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012